The Annals of Statistics 

2011, Vol. 39, No. 2, 1241-1265 

DOI: 10.1214/10-AOS870 

© Institute of Mathematical Statistics. 2011 



SPARSE LINEAR DISCRIMINANT ANALYSIS BY 
THRESHOLDING FOR HIGH DIMENSIONAL DATA 

By Jun Shao 1 , Yazhen Wang 2 , Xinwei Deng and Sijian Wang 

East China Normal University and University of Wisconsin 

In many social, economical, biological and medical studies, one 
objective is to classify a subject into one of several classes based on 
a set of variables observed from the subject. Because the probability 
distribution of the variables is usually unknown, the rule of classifi- 
cation is constructed using a training sample. The well-known linear 
discriminant analysis (LDA) works well for the situation where the 
number of variables used for classification is much smaller than the 
training sample size. Because of the advance in technologies, modern 
statistical studies often face classification problems with the number 
of variables much larger than the sample size, and the LDA may 
perform poorly. We explore when and why the LDA has poor per- 
formance and propose a sparse LDA that is asymptotically optimal 
under some sparsity conditions on the unknown parameters. For il- 
lustration of application, we discuss an example of classifying human 
cancer into two classes of leukemia based on a set of 7,129 genes and 
a training sample of size 72. A simulation is also conducted to check 
the performance of the proposed method. 

1. Introduction. The objective of a classification problem is to classify 
a subject to one of several classes based on a p-dimensional vector x of 
characteristics observed from the subject. In most applications, variability 
exists, and hence x is random. If the distribution of x is known, then we can 
construct an optimal classification rule that has the smallest possible mis- 
classification rate. However, the distribution of x is usually unknown, and 
a classification rule has to be constructed using a training sample. A statis- 
tical issue is how to use the training sample to construct a classification rule 
that has a misclassification rate close to that of the optimal rule. 
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In traditional applications, the dimension p of x is fixed while the train- 
ing sample size n is large. Because of the advance in technologies, nowadays 
a much larger amount of information can be collected, and the resulting x is 
of a high dimension. In many recent applications, p is much larger than the 
training sample size, which is referred to as the large-p-small-ra problem or 
ultra-high dimension problem when p = 0(e n ' 8 ) for some (3 € (0, 1). An ex- 
ample is a study with genetic or microarray data. In our example presented 
in Section 5, for instance, a crucial step for a successful chemotherapy treat- 
ment is to classify human cancer into two classes of leukemia, acute myeloid 
leukemia and acute lymphoblastic leukemia, based on p = 7,129 genes and 
a training sample of 72 patients. Other examples include data from radio- 
logy, biomedical imaging, signal processing, climate and finance. Although 
more information is better when the distribution of x is known, a larger di- 
mension p produces more uncertainty when the distribution of x is unknown 
and, hence, results in a greater challenge for data analysis since the training 
sample size n cannot increase as fast as p. 

The well-known linear discriminant analysis (LDA) works well for fixed- 
p-large-n situations and is asymptotically optimal in the sense that, when n 
increases to infinity, its misclassification rate over that of the optimal rule 
converges to one. In fact, we show in this paper that the LDA is still asymp- 
totically optimal when p diverges to infinity at a rate slower than yfn. On 
the other hand, Bickel and Levina (2004) showed that the LDA is asymp- 
totically as bad as random guessing when p > n; some similar results are 
also given in this paper. The main purpose of this paper is to construct 
a sparse LDA and show it is asymptotically optimal under some sparsity 
conditions on unknown parameters and some condition on the divergence 
rate of p (e.g., n _1 logp — > as n — > oo). Our proposed sparse LDA is based 
on the thresholding methodology, which was developed in wavelet shrink- 
age for function estimation [Donoho and Johnstone (1994), Donoho et al. 
(1995)] and covariance matrix estimation [Bickel and Levina (2008)]. There 
exist a few other sparse LDA methods, for example, Guo, Hastie and Tib- 
shirani (2007), Clemmensen, Hastie and Ersb0ll (2008) and Qiao, Zhou and 
Huang (2009). The key differences between the existing methods and ours 
are the conditions on sparsity and the construction of sparse estimators of 
parameters. However, no asymptotic results were established in the existing 
papers. 

For high-dimensional x in regression, there exist some variable selection 
methods [see a recent review by Fan and Lv (2010)]. For constructing a clas- 
sification rule using variable selection, we must identify not only components 
of x having mean effects for classification, but also components of x having 
effects for classification through their correlations with other components 
[see, e.g., Kohavi and John (1997), Zhang and Wang (2010)]. This may be 
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a very difficult task when p is much larger than n, such as p = 7,129 and 
n = 72 in the leukemia example in Section 5. Ignoring the correlation, Fan 
and Fan (2008) proposed the features annealed independence rule (FAIR), 
which first selects m components of x having mean effects for classification 
and then applies the naive Bayes rule (obtained by assuming that com- 
ponents of x are independent) using the selected m components of x only. 
Although no sparsity condition on the covariance matrix of x is required, the 
FAIR is not asymptotically optimal because the correlation between compo- 
nents of x is ignored. Our approach is not a variable selection approach, that 
is, we do not try to identify a subset of components of x with a size smaller 
than n. We use thresholding estimators of the mean effects as well as Bickel 
and Levina's (2008) thresholding estimator of the covariance matrix of x, 
but we allow the number of nonzero estimators (for the mean differences or 
covariances) to be much larger than n to ensure the asymptotic optimality 
of the resulting classification rule. 

The rest of this paper is organized as follows. In Section 2, after intro- 
ducing some notation and terminology, we establish a sufficient condition 
on the divergence of p under which the LDA is still asymptotically close 
to the optimal rule. We also show that, when p is large compared with n 
(p/n — > oo), the performance of the LDA is not good even if we know the 
covariance matrix of x, which indicates the need of sparse estimators for 
both the mean difference and covariance matrix. Our main result is given 
in Section 3, along with some discussions about various sparsity conditions 
and divergence rates of p for which the proposed sparse LDA performs well 
asymptotically. Extensions of the main result are discussed in Section 4. In 
Section 5, the proposed sparse LDA is illustrated in the example of classify- 
ing human cancer into two classes of leukemia, along with some simulation 
results for examining misclassification rates. All technical proofs are given 
in Section 6. 

2. The optimal rule and linear discriminant analysis. We focus on the 
classification problem with two classes. The general case with three or more 
classes is discussed in Section 4. Let x be a p-dimensional normal random 
vector belonging to class k if x ~ N p (fi k , £), k = 1,2, where fj, ± ^ /x 2 , and 
S is positive definite. The misclassification rate of any classification rule 
is the average of the probabilities of making two types of misclassification: 
classifying x to class 1 when x ~ iVp(// 2 , ^) an d classifying x to class 2 when 
x-iVpO^E). 

If /x 2 and £ are known, then the optimal classification rule, that is, the 
rule with the smallest misclassification rate, classifies x to class 1 if and only 
if 5'S~ 1 (x — ji) > 0, where /x = + /j, 2 )/2, 6 = — /x 2 , and a' denotes the 
transpose of the vector a. This rule is also the Bayes rule with equal prior 
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probabilities for two classes. Let -Ropt denote the misclassification rate of 
the optimal rule. Using the normal distribution, we can show that 

(1) #opt = $(-A p /2), A p = V5 7 E ZI 5, 

where $ is the standard normal distribution function. Although < -Ropt < 
1/2, -Ropt — > if A p — > oo as p — > oo and -Ropt — > 1/2 if A p — > 0. Since 1/2 
is the misclassification rate of random guessing, we assume the following 
regularity conditions: there is a constant cq (not depending on p) such that 

(2) Cq 1 < all eigenvalues of 5] < cq 
and 

(3) c^ 1 < maxt^ < co, 

j<p 

where 5j is the jth component of S. Under (2)-(3), Ay, > Cq~ , and hence 
#opt < 4>(-(2c )" 1 ) < 1/2. Also, A 2 p = 0(\\S\\ 2 ) and \\S\\ 2 = 0{A 2 p ) so that 
the rate of \\S\\ 2 — > oo is the same as the rate of A 2 — > oo, where ||a|| is the 
L2-norm of the vector a. 

In practice, fj,/, and S are typically unknown, and we have a training sam- 
ple X = {x^, i = 1, . . . , nfc, k = 1,2}, where is the sample size for class k, 
Xfcj ~ N p (n k ,'E), k = 1,2, all Xfcj's are independent and X is independent 
of x to be classified. The limiting process considered in this paper is the 
one with n = ri\ + ri2 — > oo. We assume that n\jn converges to a constant 
strictly between and 1; p is a function of n, but the subscript n is omitted 
for simplicity. When n — > oo, p may diverge to oo, and the limit oipjn may 
be 0, a positive constant, or oo. 

For a classification rule T constructed using the training sample, its per- 
formance can be assessed by the conditional misclassification rate -Ry(X) 
defined as the average of the conditional probabilities of making two types 
of misclassification, where the conditional probabilities are with respect to x, 
given the training sample X. The unconditional misclassification rate is 
Rt = E[Rt(X.)]. The asymptotic performance of T refers to the limiting 
behavior of .Rt(X) or Rt as n — > oo. Since < -Ry(X) < 1, by the domi- 
nated convergence theorem, if -R^(X) — >p c, where c is a constant and — >p 
denotes convergence in probability, then Rt — > c. Hence, in this paper we fo- 
cus on the limiting behavior of the conditional misclassification rate -Rt(X). 

We hope to find a rule T such that -Rt(X) converges in probability to 
the same limit as -Ropt, the misclassification rate of the optimal rule. If 
-Ropt — > 0, however, we hope not only -R^(X) — >p 0, but also i?r(X) and 
-Ropt have the same convergence rate. This leads to the following definition. 

Definition 1 . Let T be a classification rule with conditional misclassi- 
fication rate -Rt(X), given the training sample X. 
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(i) T is asymptotically optimal if Rt(X.) / Ropt ->p 1. 

(ii) T is asymptotically sub-optimal if i?^(X) — -Ropt ->p 0. 

(iii) T is asymptotically worst if Rt(X.) — >p 1/2. 

If lim ?woo -Ropt > [i.e., A p in (1) is bounded], then the asymptotic sub- 
optimality is the same as the asymptotic optimality. Part (iii) of Definition 1 
comes from the fact that 1/2 is the misclassification rate of random guessing. 

In this paper we focus on the classification rules of the form 

(4) classifying x to class 1 if and only if <5'S -1 (x — p,) > 0, 

where S, fi and S" 1 are estimators of 6, ji and XI -1 , respectively, con- 
structed using the training sample X. 

The well-known linear discriminant analysis (LDA) uses the maximum 
likelihood estimators xi, x 2 and S, where 

^ «fc 2 n k 

Xfe = — y^Xfci, A; = 1,2, S = -V]y7x fci -x fc )(x A . i -x fc ) / . 

K i=l k=l i=l 

The LDA is given by (4) with 8 = xi - x 2 , /x = x = (xi + x 2 )/2, £ -1 = S _1 
when S _1 exists, and S _1 = a generalized inverse S~ when S _1 does not 
exist (e.g., when p> n). A straightforward calculation shows that, given X, 
the conditional misclassification rate of the LDA is 



2 kTi V Vd's-^i:-^ 

Is the LDA asymptotically optimal or sub-optimal according to Defini- 
tion 1? Bickel and Levina [(2004), Theorem 1] showed that, if p > n and 
p/n — > oo, then the unconditional misclassification rate of the LDA con- 
verges to 1/2 so that the LDA is asymptotically worst. A natural question 
is, for what kind of p (which may diverge to oo), is the LDA asymptotically 
optimal or sub-optimal. The following result provides an answer. 

Theorem 1. Suppose that (2)-(3) hold and s n = p^/logp/y/n — > 0. 

(i) The conditional misclassification rate of the LDA is equal to 

i? LDA (X) = $(-[1 + P (s n )]A p /2). 

(ii) If A p is bounded, then the LDA is asymptotically optimal and 

.Rlda(X) , , 

— l _ U P {s n ). 

itOPT 

(iii) If A p — >oo, then the LDA is asymptotically sub-optimal. 
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(iv) If A p — > oo and s n A p = (p\/log p / y/n) A p — > 0, then the IDA is asymp- 
totically optimal. 

Remark 1. Since A p ft under conditions (2) and (3), when A p is 
bounded, s n A p — > is the same cts — ^ 0, which, is satisfied if p — O(tz^) 
with < A < 1/2. When A p -> oo, s n Ap — y is stronger than s n — y 0. Under 
(2)-(3), Ap = 0(p). Hence, the extreme case is A p is a constant times p, 
and the condition in part (iv) becomes p 2 y/\ogp/ sfn — > 0, which holds when 
p = O(n^) with < A < 1/4. In the traditional applications with a fixed p, 
A p is bounded, s n — > as n — > oo and thus Theorem 1 proves that the LDA 
is asymptotically optimal. 

The proof of part (iv) of Theorem 1 (see Section 6) utilizes the following 
lemma, which is also used in the proofs of other results in this paper. 

Lemma 1. Let £ n and T n be two sequences of positive numbers such that 
£ n — > oo and r n — > as n — > oo. // Xm\ n ^ tOQ T n ^ n = 7, where 7 may be 0, 
positive, or 00, then 

Um *(-Vp-T,,)) =e7 
rwoo $(-VQ 

Since the LDA uses S~ to estimate S _1 when p> n and is asymptotically 
worst as Bickel and Levina (2004) showed, one may think that the bad 
performance of the LDA is caused by the fact that S~ is not a good estimator 
of S - . Our following result shows that the LDA may still be asymptotically 
worst even if we can estimate S _1 perfectly. 

Theorem 2. Suppose that (2)-(3) hold, p/n— > 00 and that X is known 
so that the LDA is given by (4) with S _1 = £~ ; S = xi — x 2 and fi = x. 

(i) If A 2 /y/p/n -)■ (which is true if A p ft 00 ) , then i? LD A(X) -»p 1/2. 

(ii) // A 2 p/^/pJn c witfi < c < 00, i/ien i?LDA(X) — 7-p a constant 
strictly between and 1/2 and i?LDA(X)/i?oPT — >p 00. 

(iii) If A 2 / 'y/p/n-t oo, t/ien J R LDA (X) -)-p 6ui #lda(X)/.Ropt ->p 00. 

Theorem 2 shows that even if XI is known, the LDA may be asymptot- 
ically worst and the best we can hope is that the LDA is asymptotically 
sub-optimal. It can also be shown that, when /x 1 and /i 2 are known and we 
apply the LDA with 6 = 6 and p, = + /x 2 )/2, the LDA is still not asymp- 
totically optimal when \\6\\ 2 — ||<5n|| 2 ft 0, where 8 n is any sub- vector of S 
with dimension n. This indicates that, in order to obtain an asymptotically 
optimal classification rule when p is much larger than n, we need sparsity 
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conditions on 51 and S when both of them are unknown. For bounded A p 
(in which case the asymptotic optimality is the same as the asymptotic sub- 
optimality), by imposing sparsity conditions on S, fi 1 and /i 2 , Theorem 2 
of Bickel and Levina (2004) shows the existence of an asymptotically opti- 
mal classification rule. In the next section, we obtain a result by relaxing 
the boundedness of A p and by imposing sparsity conditions on S and S. 
Since the difference of the two normal distributions is in S, imposing a spar- 
sity condition on d is weaker and more reasonable than imposing sparsity 
conditions on both and /x 2 . 

3. Sparse linear discriminant analysis. We focus on the situation where 
the limit of p/n is positive or oo. The following sparsity measure on 5] is 
considered in Bickel and Levina (2008): 

p 

(6) C hp = maxS^\(Tji\ h , 

j<p * — • 

J - y i=i 

where Oj\ is the (j, l)th element of E, h is a constant not depending on p, 
< h < 1 and 0° is defined to be 0. In the special case of h = 0, Co iP in (6) 
is the maximum of the numbers of nonzero elements of rows of XI so that 
a Co t p much smaller than p implies many elements of XI are equal to 0. If 
Ch, p is much smaller than p for a constant h £ (0, 1), then X is sparse in the 
sense that many elements of X are very small. An example of Ch,p much 
smaller than p is C/j jP = 0(1) or Ch, p = O(logp). 
Under conditions (2) and 

(7) i!«Uo, 

n 

Bickel and Levina (2008) showed that 

(8) ||S - S|| =0 P (d n ) and US" 1 - S" 1 1| = P {d n ), 

where d n = Ch, p (n~ 1 logp)( 1 ~ h ^ 2 , S is S thresholded at t n = Mi v / Iogp/y / n 
with a positive constant M\\ that is, the (j, l)th element of XI is &jil(\aji\ > 
t n ), &ji is the (j, Z)th element of S and 1(A) is the indicator function of the 
set A. We consider a slight modification, that is, only off-diagonal elements 
of S are thresholded. The resulting estimator is still denoted by S and it 
has property (8) under conditions (2) and (7). 

We now turn to the sparsity of d. On one hand, a large A p results in 
a large difference between N p (m, E) and N p (p,2, so that the optimal rule 
has a small misclassification rate. On the other hand, a larger divergence rate 
of A p results in a more difficult task of constructing a good classification 
rule, since S has to be estimated based on the training sample X of a size 
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much smaller than p. We consider the following sparsity measure on S that 
is similar to the sparsity measure C^p on S: 

(9) A^ = I>?> 

where Sj is the jth component of S, g is a constant not depending on p and 
< g < 1. If -D 9 ,p is much smaller than p for a [0,1), then S is sparse. 
For defined in (1), under (2)-(3), A 2 < c ||<5|| 2 < Co +2(W) D 9iP . Hence, 
the rate of divergence of A 2 is always smaller than that of D 9jP and, in 
particular, A p is bounded when D g ^ p is bounded for a g £ [0, 1). 
We consider the sparse estimator S that is S thresholded at 



(10) a n = M 2 



logp 



with constants M 2 > and a £ (0, 1/2), that is, the jth component of S is 
5jl(\5j\ > a n ), where Sj is the jth component of d. The following result is 
useful. 

Lemma 2. Let Sj be the jth component of 8, Sj be the jth component 
of S, a n be given by (10) and r > 1 be a fixed constant. 

(i) If (7) holds, then 



(11) P[ Pi {l4l<«n} 

l<j<P,l^'l<an/r- 

and 

(12) p( p| {\Sj\>a n }\^l. 

l<j<P,\Sj\>ran 

(ii) -Let g n o = the number of j 's with \Sj \ > ra n , q n = the number of j 's 
with \Sj \ > a n /r and q = the number of j 's with \Sj \ > a n . If (7) holds, then 

P(q n o < q < q n ) ->• 1- 



We propose a sparse linear discriminant analysis (SLDA) for high-dimen- 
sion p, which is given by (4) with S = S, X = S and ft = x. The following 
result establishes the asymptotic optimality of the SLDA under some con- 
ditions on the rate of divergence of p, Ch )P , D g ^ p , q n and A p . 
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Theorem 3. Let Ch, P be given by (6), D 9tP be given by (9), a n be given 
by (10), q n be as defined in Lemma 2 and d n = Cf ltP (n~ 1 logp) ( - 1 ~ h ^ 2 . As- 
sume that conditions (2), (3) and (7) hold and 

(13) b n = max<k, a ^ 9 f^\ y^t\ _> o. 

I A p A py /n ) 

(i) The conditional misclassification rate of the SLDA is equal to 

^slda(X) = $(-[1 + Op(b n )]A p /2). 

(ii) If A p is bounded, then the SLDA is asymptotically optimal and 

itOPT 

(iii) If A p — > oo, then the SLDA is asymptotically sub-optimal. 

(iv) If Ap — > oo and b n A 2 — > 0, then the SLDA is asymptotically optimal. 

Remark 2. Condition (13) may be achieved by an appropriate choice 
of a in a n , given the divergence rates of Ch, p , D g>p , q n and A p . 

Remark 3. When A p is bounded and (2)-(3) hold, condition (13) is 
the same as 

(14) d n ^0, D g ,pa^ l -3)^Q and C htP q n /n^0. 

Remark 4. When A p — > oo, condition (13), which is sufficient for the 

asymptotic sub-optimality of the SLDA, is implied by d n — > 0, Dg^a^ 1 9 ^ = 
O(l) and Ch :P q n /n = 0(1). When A p — > oo, the condition b n A 2 — > 0, which 
is sufficient for the asymptotic optimality of the SLDA, is the same as 

(15) A 2 p d n ^0, A 2 p D gt pa 2 n ^ -> and A 2 p C h , p q n /n -»> 0. 

We now study when condition (13) holds and when b n A 2 — > with A p — > 

00. By Remarks 3 and 4, (13) is the same as condition (14) when A p is 
bounded, and b n A 2 — > is the same as condition (15) when A p — > oo. 

1. If there are two constants c\ and C2 such that < c\ < \5j\ < C2 for any 
nonzero 5j, then q n is exactly the number of nonzero <5j's. Under condi- 
tion (3), and Z?o,p have exactly the order q n . 

(a) If q n is bounded (e.g., there are only finitely many nonzero <5/s), 
then A p is bounded and condition (13) is the same as condition (14). 
The last two convergence requirements in (14) are implied by d n = 
Ch, P { n ~ l x logp)( 1-/l )/ 2 — > 0, which is the condition for the consis- 
tency of 5] proposed by Bickel and Levina (2008). 
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(b) When q n — > oo (A p — > oo), we assume that q n = 0(n v ) and Ch, P = 
0(n 7 ) with r\ G (0, 1) and 7 G [0, 1). Then, condition (15) is implied by 

^+7( n -l logp)^-' 1 )/ 2 -> 0, n 2r] (rT x logp) 2a -> 0, 

(16) 

n 2„ +7 -l^ _ 

If we choose a = (1 — h)/4, then condition (16) holds when 277 + 7 < 1 
and n r?+7 (n~ 1 logp)^"^/ 2 — > 0. To achieve (16) we need to know the 
divergence rate of p. If p = 0(n K ) for a/t>l, then (n" 1 logp)^ 1 "' 1 )/ 2 = 
0((n _1 logn)( 1-/l )/ 2 ), and thus condition (16) holds when 77 + 7 < 
(1 - /i)/2 and 77 < (1 + h)/2. If p = 0{e nP ) for a /5 G (0,1), which 
is referred to as an ultra-high dimension, then (n~ 1 logp) ( - 1 ~ h ^ 2 = 
( n /3-i)(i-^)/2 5 and con dition (16) holds if 77 + 7 < (1 - - (3)/2 
and ?7< l-(l-/i)(l-/3)/2. 
2. Since 

A p ^ E ^ ^(a„/r) 2 

i:|5j|>a n /r 

and 

j:|5j|>a n /r 

we conclude that 

fA 2 D 



(17) q n = 0(mm\^ 



9,P 



The right-hand side of (17) can be used as a bound of the divergence rate 
of q n when q n — > 00, although it may not be a tight bound. For example, 
if A 2 = O(logp) and the right-hand side of (17) is used as a bound for 
q n , then the last convergence requirement in (14) or (15) is implied by 
the first convergence requirement in (14) or (15) when a < (1 + h)/4. 

3. If Dg, p = 0(Ch, p ), then the second convergence requirement in (14) or 
(15) is implied by the first convergence requirement in (14) or (15) when 
a > (l-/i)/[4(l -<?)]• 

4. Consider the case where Ch, P = O(logp), D 9iP = O(logp) and an ultra- 
high dimension, that is, p = 0(e n ) for a j3 G (0,1). From the previous 
discussion, condition (14) holds if d n — > 0, and (15) holds if <i n logp— > 
0. Since logp = 0(tt/), d n = 0(n^ + ^~ 1 ^ 1 ~ h ^ 2 ) , which converges to if 
/3 < (1 — h)/(3 — h). If A p is bounded, then d n — > is sufficient for condi- 
tion (13). If A p — > 00, then the largest divergence rate of A 2 is O(logp) = 
O(n^) and A 2 d n — > (i.e., the SLDA is asymptotically optimal) when 
P < (1 - /i)/(5 - h). When h = 0, this means < 1/5. 
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5. If the divergence rate of p is smaller than 0(e n ' 3 ) then we can afford to 
have a larger than 0(logp) divergence rate for Ch, p and D gtP . For exam- 
ple, if p = 0(n K ) for a k > 1 and maxjC^p, D g , p } = en 1 for a 7 G (0, 1) 
and a positive constant c, then \ogp = O(logn) diverges to 00 at a rate 
slower than n 1 . We now study when condition (14) holds. First, d n = 
C hjP (n~ x \ogp)^~ h ^ 2 = 0(n 7 ~( 1 ~ /l )/ 2 (logn)( 1 ~ /l )/ 2 ), which converges to 
if 7 < (l-/i)/2 < 1/2. Second, a 2 ^~^D g ^ = O^-^-^logn) 2 ^) ), 
which converges to if a is chosen so that a > 7/[2(l — #)]. Finally, if 
we use the right-hand side of (17) as a bound for q n , then Ch, P q n /n = 
( n 2(i- 9 )Q+7-i/(io g n) 2 ( 1 -9) Q ), which converges to if a < (l-7)/[2(l- 
g)]. Thus, condition (14) holds if 7 < (1 - h)/2 and 7/[2(l - 3)] < a < 
(1 - 7)/[2(l — g)]. For condition (15), we assume that A 2 = 0(n n ) with 
a /? 6 [0, 1] (p = corresponds to a bounded A p ). Then, a similar analysis 
leads to the conclusion that condition (15) holds if (1 + p) 7 < (1 — h)/2 
and (1 + ph/[2(l - g )] <a <[l-(l + p) 7 ]/[2(l - g)}. 

To apply the SLDA, we need to choose two constants, M\ in the thresh- 
olding estimator S and M2 in the thresholding estimator 8. We suggest 
a data-driven method via a cross-validation procedure. Let X^j be the data 
set containing the entire training sample but with x^j deleted, and let Tki 
be the SLDA rule based on Xfcj, i = 1, . . . , rtfc, k = 1,2. The leave-one-out 
cross-validation estimator of the misclassification rate of the SLDA is 

-y 2 n k 

RsLDA = -J2J2 rk ^ 

k=l i=l 

where r^j is the indicator function of whether classifies x^j incorrectly. 
Let R(n\,n-2) denote -Rslda when the sample sizes are n\ and Then 

TTi/ £> v iv^V^p/ n n l R{ni-l,n2)+n 2 R{n l ,n 2 -l) 
E(Rslda) = - > v > E[r ki ) = , 

fc=l i=l 

which is close to R(ni,ri2) = -Rslda for large n^. Let -Rslda (-^i> -^2) be 
the cross-validation estimator when (M±,M2) is used in thresholding S 
and S. Then, a data-driven method of selecting (M\,M'i) is to minimize 
^slda(-^i ; M2) over a suitable range of (M\,M2). The resulting -Rslda can 
also be used as an estimate of the misclassification rate of the SLDA. 

4. Extensions. We first consider an extension of the main result in Sec- 
tion 3 to nonnormal x and Xfcj's. For nonnormal x, the LDA with known \i k 
and Jj, that is, the rule classifying x to class 1 if and only if <5'S _1 (x — p,) > 0, 
is still optimal when x has an elliptical distribution [see, e.g., Fang and An- 
derson (1990)] with density 

(18) Cp |E|- 1 / 2 /((x- / ,)'l]- 1 (x- / x)), 
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where [i is either // 1 or /x 2 , / is a monotone function on [0, oo), and c p is 
a normalizing constant. Special cases of (18) are the multivariate ^distribu- 
tion and the multivariate double-exponential distribution. Although this rule 
is not necessarily optimal when the distribution of x is not of the form (18), 
it is still a reasonably good rule when fi k and I] are known. Thus, when 
fj, k and S are unknown, we study whether the misclassification rate of the 
SLDA defined in Section 3 is close to that of the LDA with known fi k and XI. 

From the proofs for the asymptotic properties of the SLDA in Section 3, 
the results depending on the normality assumption are: 

(i) result (8), the consistency of X; 

(ii) results (11) and (12) in Lemma 2; 

(iii) the form of the optimal misclassification rate given by (1); 

(iv) the result in Lemma 1. 

Thus, if we relax the normality assumption, we need to address (i)-(iv). 
For (i), it was discussed in Section 2.3 of Bickel and Levina (2008) that 
result (8) still holds when the normality assumption is replaced by one of 
the following two conditions. The first condition is 

(19) swpE(e tx ^) < oo for all \t\ < t 

k,j 

for a constant to > 0, where x k ij is the jth component of x^. Under condi- 
tion (19), result (8) holds without any modification. The second condition is 

(20) supE\x kij \ 2u <oo 

k,j 

for a constant v > 0. Under condition (20), result (8) holds with n~ 1 \ogp 
changed to n~ 1 p 4: / u . The same argument can be used to address (ii), that 
is, results (11) and (12) hold under condition (19) or condition (20) with 
logp replaced by n~ 1 p 4: / u . For (iii), the normality of x can be relaxed to 
that, for any p-dimensional nonrandom vector 1 with |1| = 1 and any real 
number t, 

(21) P(l'I]- 1 /2 (x _ /x) < t) = ^ W) 

where ^ is an unknown distribution function symmetric about but it does 
not depend on 1. Distributions satisfying (21) include elliptical distributions 
[e.g., a distribution of the form (18)] and the multivariate scale mixture 
of normals [Fang and Anderson (1990)]. Under (21), when fi k and S are 
known, the LDA has misclassification rate A p /2) with A p given by (1). 
It remains to address (iv). Note that the following result, 
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is the key for Lemma 1. Without assuming normality, we consider the con- 
dition 

ui -cx v 

(23) < lim — < oo, 

where 92 is a constant, < ip < 2, uj is a constant and c is a positive constant. 
For the case where ^ is standard normal, condition (23) holds with if = 2, 
uj = — 1 and c = 1/2. Under condition (23), we can show that the result in 
Lemma holds for the case of 7 = 0, which is needed to extend the result in 
Theorem 3(iv). This leads to the following extension. 

Theorem 4. Assume condition (21) and either condition (19) or (20). 
When condition (19) holds, let b n be defined by (13). When condition (20) 
holds, let a n and b n be defined by (10) and (13), respectively, with ra -1 logp 
replaced by n~ l p i l y . Assume that a n — > and b n — > 0. 

(i) The conditional misclassification rate of the SLDA is 

^slda(X) = ¥(-[1 + P (b n )]A p /2). 

(ii) If A p is bounded, then 

^slda(X) 

n^ P j2)- 1 - 0p{bn) ' 

where ^(—A p /2) is the misclassification rate of the LDA when \x k and 5] 
are known. 

(hi) If Ap -too, then _R S lda(X) -> p 0. 

(iv) If A p — > 00 and b n Ap — > 0, then 

-Rslda(X) 



*(-a p /2; 



->pl. 



We next consider extending the results in Sections 2 and 3 to the classifi- 
cation problem with K > 3 classes. Let x be a p-dimensional normal random 
vector belonging to class k if x ~ N p (fi k , X), k = 1, . . . , K , and the training 
sample be X = {x/%j, i = 1, . . . , n^, k = 1, . . . , K}, where n/% is the sample size 
for class k, Xfcj ~ N p (fj, k , S), k = 1,. . . ,K, and all independent. 
The LDA classifies x to class & if and only if 5jLiS (x — /Z fc j) > for all 
l=£k,l = l,...,K, where 8 k i = x fe -xj, /t fej = (x fc +3q)/2, x fe = n^ 1 X^i x fci 
and is an inverse or a generalized inverse of S = n _1 Ylk=i SSi( x fc« ~~ 

x fc)( x fc« — x fc)'i an d n = n\ H h . The conditional misclassification rate 

of the LDA is 

1 - 

# E E p * (^ s_1 (x - Ajj) > 0, \± 3), 

k=ljytk 



14 



SHAO, WANG, DENG AND WANG 



where Pk is the probability with respect to x ~ N p (^ k , X), k = 1, . . . ,K . The 
SLDA and its conditional misclassification rate can be obtained by simply 
replacing S and Ski by their thresholding estimators S and Ski, respectively. 
For simplicity of computation, we suggest the use of the same thresholding 
constant (10) for all S^i's. 

The optimal rate can be calculated as 

1 K 

(24) i? OPT = - ^^^(^S-^x - £ j7 ) >0,l^j), 

k=l j^k 

where Sji = fij — Hi and ft^ = (fij + A*/)/2, j,l = 1,...,K, j ^ I. Asymptotic 
properties of the LDA and SLDA can be obtained, under the asymptotic 
setting with n — > oo and n^/n — > a constant in (0, 1) for each k. Sparsity 
conditions should be imposed to each S^i- If the probabilities in expres- 
sion (24) do not converge to 0, then the asymptotic optimality of the LDA 
(under the conditions in Theorem 1) or the SLDA (under the conditions in 
Theorem 3) can be established using the same proofs as those in Section 6. 
When -Ropt in (24) converges to 0, to consider convergence rates, the proof 
of the asymptotic optimality of the LDA or SLDA requires an extension of 
Lemma 1. Specifically, we need an extension of result (22) to the case of 
multivariate normal distributions. This technical issue, together with em- 
pirical properties of the SLDA with K > 3, will be investigated in our future 
research. 

5. Numerical studies. Golub et al. (1999) applied gene expression mi- 
croarray techniques to study human acute leukemia and discovered the dis- 
tinction between acute myeloid leukemia (AML) and acute lymphoblastic 
leukemia (ALL). Distinguishing ALL from AML is crucial for successful 
treatment, since chemotherapy regimens for ALL can be harmful for AML 
patients. An accurate classification based solely on gene expression moni- 
toring independent of previous biological knowledge is desired as a general 
strategy for discovering and predicting cancer classes. 

We considered a dataset that was used by many researchers [see, e.g., 
Fan and Fan (2008)]. It contains the expression levels of p = 7,129 genes for 
n = 72 patients. Patients in the sample are known to come from two distinct 
classes of leukemia: n± = 47 are from the ALL class, and n-2 = 25 are from 
the AML class. 

Figure 1 displays the cumulative proportions defined as Y^j=i %-)/II^I| 2 j 
/ = 1,. . . ,p, where <5?.s is the jth largest value among the squared compo- 
nents of 5. These proportions indicate the importance of the^ contribution of 
each 8rj\. It can be seen from Figure 1 that the first 1,000 <5(j)'s contribute 
a cumulative proportion nearly 98%. Figure 2 plots the absolute values of 
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Fig. 1. Cumulative proportions. 



the off-diagonal elements of the sample covariance matrix S. It can be seen 
that many of them are relatively small. If we ignore a factor of 10 8 , then 
among a total of 25,407,756 values in Figure 2, only 0.45% of them vary 
from 0.35 to 9.7 and the rest of them are under 0.35. 

For the SLDA, to construct sparse estimates of S and S by threshold- 
ing, we applied the cross-validation method described in the end of Sec- 
tion 3 to choose the constants Mi and M2 in the thresholding values t n = 
Mi(n -1 logp) ' 5 and a n = M2(n~ 1 logp) ' 3 . Figure 3 shows the cross valida- 
tion scores Rsl,t>a(Mi, M2) over a range of {M\,M2). The minimum cross 
validation score is achieved at M\ = 10 and M2 = 300. These threshold- 
ing values resulted in a S with exactly 2,492 nonzero components, which is 
about 35% of all components of S, and a S with exactly 227,083 nonzero 
elements, which is about 0.45% of all elements of S. Note that the number of 
nonzero estimates of S is still much larger than n = 72, but the SLDA does 
not require it to be smaller than n. The resulting SLDA has an estimated 
(by cross validation) misclassification rate 0.0278. In fact, 1 of the 47 ALL 
cases and 1 of the 25 AML cases are misclassified under the cross validation 
evaluation of the SLDA. 

For comparison, we carried out the LDA with a generalized inverse S~. 
In the leave-one-out cross-validation evaluation of the LDA, 2 of the 47 ALL 
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cases and 5 of the 25 AML cases are misclassified by the LDA, which results 
in an estimated misclassification rate 0.0972. Compared with the LDA, the 
SLDA reduces the misclassification rate by nearly 70%. From Figure 5 of Fan 
and Fan (2008), the misclassification rate of the FAIR method, estimated 
by the average of 100 randomly constructed cross validations with nn data 
points for constructing classifier and (1 — 7r)n data points for validation 
(tv = 0.4,0.5 and 0.6), ranges from 5% to 7%, which is smaller than the 
misclassification rate of the LDA but larger than the misclassification rate 
of the SLDA. 

We also performed a simulation study on the conditional misclassification 
rate of SLDA under a population constructed using estimates from the real 
data set and a smaller dimension p = 1,714. The smaller dimension was used 
to reduce the computational cost and the 1,714 variables were chosen from 
the 7,129 variables with p-values (of the two sample i-tests for the mean 
effects) smaller than 0.05. In each of the 100 independently generated data 
sets, independent {xij, i = 1, . . . , 47} and {x2i, i = 1, . . . , 25} were generated 
from N p (p, 1 ,'S) and N p (fi 2 , S), respectively, where p = 1,714 and fi k and X 
are estimates from the real data set. The sparse estimate S was used instead 
of the sample covariance matrix S, because S is not positive definite. Since 
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the population means and covariance matrix are known in the simulation, 
we were able to compute the conditional misclassification rate .Rslda(X) 
for each generated data set. A boxplot of 100 values of .Rslda(X) in the 
simulation is given in Figure 4(a). The unconditional misclassification rate 
of the SLDA can be approximated by averaging over the 100 conditional 
misclassification rates. In this simulation, the unconditional misclassification 
rate for the SLDA is 0.069. Since the population is known in simulation, the 
optimal misclassification rate -Ropt is known to be 0.03. 

For comparison, in the simulation we computed the conditional misclassi- 
fication rates, .Rlda(X) for the LDA and .Rscrda(X) for the shrunken cen- 
troids regularized discriminant analysis (SCRDA) proposed by Guo, Hastie 
and Tibshirani (2007). Since -Rscrda(X) does not have an explicit form, it is 
approximated by an independent test data set of size 100 in each simulation 
run. Boxplots of -Rlda(X) and i?scRDA(X) for 100 simulated data sets are 
included in Figure 4(a). It can be seen that the conditional misclassification 
rate of the LDA varies more than that of the SLDA. The unconditional mis- 
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Fig. 4. Box-plots of conditional mis classification rates of SLDA, SCRDA and LDA. 



classification rate for the LDA, approximated by the 100 simulated -Rlda(X) 
values, is 0.152, which indicates a 53% improvement of the SLDA over the 
LDA in terms of the unconditional misclassification rate. The SCRDA has 
a simulated unconditional misclassification rate 0.137 and its performance 
is better than that of the LDA but worse than that of the SLDA. In this 
simulation, we also found that the conditional misclassification rate of the 
FAIR method was similar to that of the LDA. 

To examine the performance of these classification methods in the case of 
nonnormal data, we repeated the same simulation with the multivariate nor- 
mal distribution replaced by the multivariate t-distribution with 3 degrees 
of freedom. The boxplots are given in Figure 4(b) and the simulated un- 
conditional misclassification rates are 0.059, 0.194 and 0.399 for the SLDA, 
SCRDA and LDA, respectively. Since the i-distribution has a larger vari- 
ability than the normal distribution, all conditional misclassification rates in 
the i-distribution case vary more than those in the normal distribution case. 
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6. Proofs. 

Proof of Theorem 1. (i) Let frjj and Oj^ be the (J, Z)th elements 
of S and XI, respectively. From result (10) in Bickel and Levina (2008), 
max^Kp |<7jy - cTjyj = Op (s/logp/y/n). Then, 

p 

||S - E|| < maxV |<t,- ; - a j t \ = P (p\/\ogp/ \fn) = Op(s n ), 
i<p * — * 
i=i 

where ||A|| is the norm of the matrix A defined as the maximum of all 
eigenvalues of A. By (2)-(3) and s n — > 0, S _1 exists and 

Us- 1 - s- 1 !! = ||s _ 1 (s - s)s~ 1 || < lis^ims - Elms- 1 !! = o P {s n ). 

Consequently, 

S'S^S'^S = 8"$- l 5[l + P {s n )] = 8"ET l 8[\ + P {s n )}. 

Since E[(S - SyE' 1 ^ - S)\ = 0{p/n) and tfp'JT 1 ^ - S)} 2 < A 2 p E[(8 - 
SyE^iS-S)}, we have 

S'^S = S'S^S + 25'V- 1 (5 -g) + (6- ayxr 1 ^ - 6) 



A 2 

p 



i + o P [-gA + o P ( p 



Al[l + P (s n )}, 



where the last equality follows from y/p/ (s n y/nA p ) = 1/ ' (yJp\ogpA p ) = 0(1). 
Combining these results, we obtain that 

SS^S = a'S-^fl + P (s n )} = A 2 [l + P (s n )} 2 

= A 2 [1 + P (s n )]. 

Then 



A|[l + P (s n )] . , 

+ o P (J? 



2y/l + Op( Sn ) 



-^L[l + P (s n )]+0 P lJ^ 
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2 
A 



l + P (s n ) + Op 



t[l + Op{s n )]. 



Similarly, we can show that 

5 / S- 1 (ju 2 -x 2 )-5'S- 1 5/2_ A 



P -[l + Op( Sr 



These results and formula (5) imply the result in (i). 
(ii) Let <j) be the density of <3?. By the result in (i), 

#lda(X) - Ropt = (p(uj n )Op(s n ), 

where ui n is between —A p /2 and — [l + Op(s n )]A p /2. Since 4>(ui n ) is bounded 
by a constant, the result follows from the fact that i?oPT is bounded away 
from when A p is bounded. 

(hi) When A p — > oo, Ropt — > 0, and, by the result in (i), /?lda(X) — >p 0. 

(iv) If A p —7- oo, then, by Lemma 1 and the condition s n A p — > 0, we con- 
clude that .Rlda(X)/-Ropt ->pl. □ 

PROOF of Lemma 1. It follows from result (22) that 

gn(l-T n ) [£„-£„ (l- r „)2l/2 / ^(~VU^ ~ T„)) 



< 



1 + Ul-Tn) 2 $(-V&) 



4€n-5n(l-T„) 2 ]/2 



Cn(l -T, ' 



Since £ n — >• oo and r n — )• 0, 



Ul - T " } '1 and * + k 



l + ^n(l-r n ) 2 &»(1-t„ 

The result follows from [£ n — £ n (l — r n ) 2 ]/2 = £ n T n (l — r n /2) — )• 7 regardless 
of whether 7 is 0, positive, or 00. □ 

Proof of Theorem 2. For simplicity, we prove the case of n\ =112 = 
n/2. 

(i) The conditional misclassification rate of the LDA in this case is given 
by (5) with X replaced by S. Note that I]~ 1 / 2 (x / t — fj. k ) ~ N p (0, n^ 1 !), where 
I is the identity matrix of order p. Let Q be the jth component of H^^S. 
Then, Y^a=\ Cj = an d the jth component of S _1 ^ 2 (xfc — fj, k ) is n 1 1//2 £fcj, 
and the jth component of Y,~ l / 2 8 is Cj + n i ~ £ 2j), j = 1, • • • ,P, where 
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£kj, j = 1, ■ ■ • ,p, k = 1, 2, are independent standard normal random variables. 
Consequently, 

»^. ( *-*>-i*-'i/>-E(-f + ^ + Sa) 



and 

2 



1 p 2 P 



A2 + ^[i + 0p (i)]+Opf^ 
re V v n 

A; + ^[1 + op(1)], 
^ n 



where the last equality follows from A 2 , = O(p) under (2)— (3). Combining 
these results, we obtain that 

(25) 7= = , +op(l). 

VS'^S 2 A /A2 + (4p/n)[l + o P (l)] 



Similarl y, we can prove that (25) still holds if Xi — /x x is replaced by /x 2 — X2. 
If Ap/^p/n — > 0, then the quantity in (25) converges to in probability. 
Hence, .Rlda(X) ^ p 1/2. 

(ii) Since p/n — > oo, Ap/(p/n) — > 0. Then, the quantity in (25) con- 
verges to —c/4 in probability and, hence, -Rlda(X) — >p 3>(— c/4), which 
is a constant between and 1/2. Since A p — > oo, -Ropt - > and, hence, 
i? LD A(X)/i?OPT ->p oo. 

(hi) When A 2 / ' yp/n — > oo, it follows from (25) that the quantity on 
the left-hand side of (25) diverges to — oo in probability. This proves that 
.Rlda(X) — >p 0. To show -Rlda(X)/-Ropt — oo> we need a more refined 
analysis. The quantity on the left-hand side of (25) is equal to 



A 2 p + Op{^p/n) + Op(A p /VH) _ _Ap 
2 A /A2 + (4p/n)[l + p(l)] 2 
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where 

A p + P (^/p/n)/A p + Op(l/y^) 
7"n — J- / 



^A2 + (4p/n)[l + 0p (l)] 
and P(0 < T n < 1) ->• 1. Note that 

Tin = i 



A| + (4p/n)[l + o P (l)] 

(4p/n)[l + o P (l)] 



and 



A 2 + (4p/n)[l + op(1)] + A p ,/A 2 + (4p/n)[l + o P (l)] 



_ Op{y/p/n)/A p + P (l/^) = Q P (^p/n) | P (l/y^) 
A2 + (4p/n)[l + p(l)] A l A p 



Op(VW^) 



A 2 



under (2) and (3). Then 



r n A 2 p = r ln A 2 p + T 2n A 2 p = r ln A 2 p + Op( y/p/n). 
If A 2 /{p/n) is bounded, then ti„ > c for a constant c> and 

r n A^ > c a2 + Op(Vp7"), 



which diverges to oo in probability since A 2 / sjpjn — > oo. If A 2 / (p/n) — >• oo, 



then Ti n A 2 > cp/n for a constant c> and 



r n Ap > cp/n + P {\fpJn), 

which diverges to oo in probability since p/n — > oo. Thus, r n A 2 — > oo in 
probability, and the result follows from Lemma 1. □ 

PROOF of Lemma 2. (i) It follows from (22) that, for all t, 
P{\5 j -5 j \>t)<c l e- C2nt \ 
where c\ and C2 are positive constants. Then, the probability in (11) is 

l-p( (J {\5 j \>a n }\>l-Y < p (.\Sj-S j \>a n (r-l)/r) 

l<j<P,\Sj\<a n /r ' j=l 



> 1 — pc\e 



-C2na 2 l {r—l) 2 /r 2 
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Because 



2 / \ 1-2q 

nat I n 



logp V^°SP 



oo 



when a < 1/2, we conclude that pc\e C2nct n( r *) I r —). 0, and thus (11) holds. 
The proof of (12) is similar since 



1-p( |J {|^|<On}^ ^l-J^Pd^-^IXlnfr-l)) 

l<j<P,|<5j|>ra„ i=l 



> 1 - p Cl e- C2na « {r - 1)2 . 
(ii) The result follows from results (11) and (12). □ 

Proof of Theorem 3. The conditional misclassification rate -Rslda(X) 
is given by 

1 E 9 ( { ~ l)k ~ 6 ' t ' l ^ k ~ * fc) ~ j!gZj/g 

2 fc =i V Vs't'^t^s 

From result (8), 

S't- 1 ^- 1 ^ = S't^S[l + P {d n )} = ~8'T,- 1 ~8[1 + Op(d n )}. 

Without loss of generality, we assume that S = (5^,0')', where S\ is the q- 
vector containing nonzero components of S. Let d = (Si,S' )', where d\ has 
dimension q. From Lemma 2(h), \\Si — Si\\ 2 = Op(q n /n) and, with probabil- 
ity tending to 1, 

IIM 2 = E ^ E *f<(™nf M E Sf = 0(al^D g , p ). 

j:\6j\<a„ j-\^j\<ra n j--\Sj\<ra n 

Let k n = max{aT~ 9) D gtP ,q n /n}. Then \\S - d\\ 2 = \\~S 1 - Si\\ 2 + \\d \\ 2 = 
P {k n ). This together with (2)-(3) implies that (6 - SyYT 1 (6 - S) = P (k n ), 
and hence 

S'E"^ = A 2 p + 2<5':£(<5 - 6) + (5 - Sy^CS - S) 
= A 2 p [l + P (y/k~/A p ) + P (k n /A 2 p )} 
= A 2 p [l + P (^/k~ l /A p )]. 



Write 



1 3 ?, 12 i 

2j 12 2j 2 
2j 12 2j2 







Cl2 






c 2 






Cl2 




vc' 12 


c 2 
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where Si, Si , Ci and Ci are q n x q n matrices with q n defined in Lemma 2(h). 
Then 

C12 = — S 1 1 Si2C2 and C12 = — S 1 Si2C2- 

If 8\ = (5^,0')' and xi — /z x = (£i,£d)', where 8\ and £1 have dimension q n , 
then 

<5'S _1 (xi - /Lti) = 5'iCi^i + (5'iCi 2 £o = ^fCi^i - d / iSj" 1 Si2C 2 ^ . 
Since £ x has dimension q n , 

(Kciti) 2 < (^ic^o^iciii) = (eic^x^ir 1 *) = o P ( gn /„)(ys-^) 

and hence 



Since S^ 1 < Ci, 

(5' 1 S ] ; 1 Si2C2^o) 2 - (^ / iSj" 1 5i)(^ / C2S / 12 S^ 1 Si2C 2 ^ ) 
< (5 / 1 Cidi)(|oC 2 S / 12 Sj" 1 Si2C2^ ) 
= (d / S~ 1 5)(^QC2S / 12 Sj" 1 Si2C 2 ^o)- 

Prom result (8), 

^qC 2 S / 12 S^ 1 Si2C2^ = ^oC 2 S / 12 S|f 1 Si2C2^o[l + Op(d n )}. 

Under condition (2), all eigenvalues of sub-matrices of S and S _1 are 
bounded by cq. Repeatedly using condition (2), we obtain that 

£ , (£ / C2S / 12 £ 1 1 Si 2 C2^o) - c o^(^oC2S / i2Si 2 C2^o) 

= con- 1 trace(Si2C 2 S 2 C 2 S / 12 ) 
< c n _1 trace (Si 2 S' 12 ) 



4 v 

3=1 l=q n + l 



<4^ majt yui' 

= 0(C h:P q n /n), 
where h and Ch tP are given in (6). This proves that 



(5'S (xi - Hi) _ Op{\fk~) + Op{\/ C hjP q n /n) 
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which also holds when xi — is replaced by X2 — /x 2 or & ~ Note that 

s't^s = s'tr 1 ^ + {s- sytr 1 ^ + (s- ~syt^ 1 ~s 

= S't^d + (S- dyt^d + A p O p (\/K). 

Therefore, 



(-l) k ~5>£ (ti k -x k )-6"£ 6/2 _ Op(VK) + P (y/C h , p q n / 



n) 



Ap^Jl + OpiVK/Ap) 

2^1 + P (dn) 



Op(\/k n ) +0 P (JC htP q n / 



<n 



^[l + P (^K/A p ) + Op(d n )} 



l + Op 



\J Ch,pQn 

Apy/n 



= -^[1 + O p (6„)]. 

This proves the result in (i). The proofs of (ii)-(iv) are the same as the 
proofs for Theorem l(ii)-(iv) with s n replaced by b n . This completes the 
proof. □ 
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